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Abstract — We propose a distributed algorithm, named D- 
ADMM, for solving separable optimization problems in networks 
of interconnected nodes or agents. In a separable optimization 
[ problem, the cost function is the sum of all the agents' private 
cost functions, and the constraint set is the intersection of all 
the agents' private constraint sets. We require the private cost 
function and constraint set of a node to be known by that 
node only, during and before the execution of the algorithm. 
The application of our algorithm is illustrated with problems 
from signal processing and control, namely average consensus, 
compressed sensing, and support vector machines. It is well 
known that communicating in distributed environments is the 
most energy/time-demanding operation. Thus, algorithms using 
less communications are more prone to make networks live 
longer, e.g., sensor networks, or to execute faster, e.g., in super- 
computing platforms. Through simulations for several network 
types and problems, we show that our algorithm requires less 
communications than the state-of-the-art algorithms. 

Index Terms — Distributed algorithms, alternating direction 
method of multipliers, consensus, compressed sensing, machine 
I learning, sensor networks. 



I. Introduction 

Recently, there has been a growing interest in distributed 
methods for data processing. Such interest was triggered both 
by the need of processing large quantities of data in distributed 
platforms, such as supercomputers, and by applications where 
data comes from spatially different locations and central 
processing is costly or impractical, as in sensor networks. 
On the other hand, over the last decades, convex optimization 
theory has proven to be an invaluable tool for deriving data 
processing algorithms |1|. Standard algorithms like interior- 
point methods are, however, unable to handle either large-scale 
or distributed problems. To solve these problems we thus need 
new algorithms. 

In this paper, we aim to solve separable optimization 
problems in a distributed, decentralized way. In a separable 
optimization problem, the cost function can be decomposed 
as a sum of P functions fp and the constraint set can be 
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Figure 1. Network with P = 7 nodes. Node p only knows fp and Xp, but 
cooperates with its neighbors in order to solve (T). 

decomposed as the intersection of P sets Xp-. 

minimize /i (x) + /2 (x) H + fp{x) 

^ (1) 

subject to X e Xi n X2r] ■ ■ ■ n Xp , 

where x e K" is the global optimization variable. We 
will denote any solution of ([U with x*. We associate with 
problem ([TJ a network of P nodes, where only node p 
has access to its private cost function fp and to its private 
constraint set Xp. This situation is illustrated in Figure [T] Each 
node can communicate with its neighbors using the network 
infrastructure. A method that does not use any kind of special 
or central node will be called distributed algorithm. Thus, in a 
distributed algorithm, aggregating data is not allowed; instead, 
each node has to exchange information only with its neighbors 
in order to achieve the common goal: arriving at a solution 
of ([T]i. We propose a novel distributed algorithm based on 
the alternating direction method of multipliers (ADMM) for 
solving ([T]i- 

In applications such as sensor networks, nodes are battery- 
operated devices and thus energy is a scarce resource. It is 
well known that communication is the most energy-consuming 
operation in such a device |2|; therefore, algorithms using 
less communications can increase the lifespan of the network. 
Similarly, communication is known to be the algorithmic 
bottleneck in applications with more controlled environments, 
for example, in network protocols |3| or supercomputing 
platforms ID. Algorithms using few communications are thus 
required for such applications. 

We use our proposed distributed algorithm to solve several 
instances of ^ from several areas in engineering, namely 
average consensus, compressed sensing problems, and support 
vector machines. Our simulations show that, for each of these 
problems and for almost all network types, our proposed 
algorithm requires significantly less communication than prior 
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State-of-the-art algorithms, including those that are specifically 
designed for a particular problem and are not applicable to the 
entire problem class ([T]i. 

Next we formally state the problem we solve; then we give 
an overview of related work. 

Formal problem statement. Given a network with P 
nodes, we associate fp and Xp in ^ with the pth node of the 
network, p = 1, . . . ,P. We make the following assumptions: 

Assumption 1. Each fp : R" M. is a convex function 
over M", and each set Xp is closed and convex. 

Assumption 2. Problem ([T]| is solvable. 

Assumption 3. The network is connected and its topology 
does not vary with time. 

Assumption [T] involves the concepts of convexity of a func- 
tion and convexity of a set. We assume the reader is familiar 
with these concepts, as well as with Lagrangian duality 
||5l . The second assumption states that there is at least one 
vector X* that solves ([T]). This implies that the optimal cost 
r - Ep=i fp{x*) is finite (and that C^P^^Xp ^ 0). In 
Assumption [3] a network is connected if there is a path 
between every pair of nodes. 

Under the previous assumptions, we solve the following 
problem: given a network, design a distributed algorithm that 
solves ([T]). By distributed algorithm, we mean that there is 
no notion of a central or special node; also, no node in the 
network, except node p, has access to fp or Xp at any time 
before or during the algorithm. 

Related work. The first method addressing the generic 
problem ([T]l was based on an average consensus algorithm 
coupled with a subgradient method 161, Q. Such method has 
been extensively studied and is known to be robust to noise and 
link failures; however, it generally requires too many iterations 
to converge. 

Another approach for solving ([T]) in a distributed way is 
via augmented Lagrangian methods, which consist of two 
loops: an outer loop updating the dual variables, and an inner 
loop updating the primal variables. There are some options 
for the algorithms in each loop. For the outer loop, the 
most common is the gradient method fSl, f9l, ifTOl , which 
degenerates in the so-called method of multipliers ifTTI . For 
the inner loop, common choices are the nonlinear Gauss- 
Seidel method |9|, |10| and Jacobi type algorithms, such as 
the diagonal quadratic approximation lH. In | fT2l| we applied 
fast gradient algorithms ifTSl in both loops. We mention that, 
although fcSl, IT2I do not address ([T]i specifically, they can 
be easily adapted to solve it, as explained in section |III1 

A special augmented Lagrangian-based algorithm, which 
uses one loop only, is the alternating direction method of 
multipHers (ADMM) IH, d, d. In fact, ADMM is the 
method of multipliers concatenated with one iteration of the 
nonlinear Gauss-Seidel, and therefore, it consists of just one 
loop. ADMM is not directly applicable to ([T]): one has to 
reformulate that problem first. There are several options for 
such reformulation. For example, |16| proposes a distributed 
algorithm for solving specific instance of ([T]i, based on a 
particular reformulation (the works 03, HI, US, ||20l 



apply the same algorithm to other instances of OJ). Another 
possible reformulation of ([1]) is explored in ICTII. l22l . which 
also proposes an ADMM-based algorithm, but in contrast 
with lT6l . it uses two communication steps per iteration, or 
in other words, each node uses information from its second- 
order neighborhood. A comparison between [16,1 and l2T]| 
for the average consensus problem is provided in l20ll ; it 
is reported that both methods have similar convergence rates 
(and thus 1.2 1 1 uses twice as much communications as [T6|), 
while 1T6| shows more resilience to noise. The algorithm 
we propose in this paper is also based on ADMM, but 
applied to a new reformulation of ([TJ. We show through 
extensive simulations that the proposed algorithm requires less 
communications than any of the previous approaches. 

All the above algorithms can solve ([T]i in a distributed way 
and will be briefly described in section|IIIl There are, however, 
other algorithms that can solve ([T]), but are not distributed. 
For example, [23 1 applies ADMM to distributed scenarios, 
but the resulting algorithms are not distributed in the sense 
that they require a central node or a special network topology. 
Our algorithm and the ones described before, in contrast, are 
decentralized and can run on any connected network topology. 

Contributions. We propose a distributed algorithm for solv- 
ing ([T]). This algorithm is an extension of our prior work |[24l . 
where we solved a particular instance of ([T]), namely Basis 
Pursuit [25|. Here, we first show that the algorithm in l24ll 
can be generalized to the problem class ([T]i. Then we apply 
it to following problems: average consensus, two compressed 
sensing problems (different from the Basis Pursuit), and sup- 
port vector machines. We also provide extensive simulations 
of several distributed algorithms solving ([T]). Our simulations 
show that the proposed algorithm outperforms, in terms of the 
number of used communications, any of the previous algo- 
rithms. One of the above problems, the average consensus, has 
been extensively studied due to its importance and simplicity. 
For example, many distributed algorithms, which cannot be 
trivially generalized to solve ([T]i, have been proposed to solve 
it. Our algorithm, designed to solve the generic problem ([T]), 
can also solve the average consensus problem and, as our 
simulations show, it outperforms a state-of-the-art consensus 
algorithm |26|, for many networks of interest. 

Organization. The paper is organized as follows. In sec- 
tion we derive the algorithm, by manipulating the prob- 
lem and then applying ADMM. Section |III] is dedicated to 
related algorithms; in particular, we describe the competing 
algorithms and show how we compare them with ours. In 
section |IV] we take problems from several areas and see 
how they can be recast as ([T]i- Additionally, we analyze the 
resulting algorithm for each one of these problems. Finally, in 
section[Vl we describe and show the results of our simulations. 
In Appendix |A] we provide a proof of convergence of the 
particular version of ADMM we use, making this paper self- 
contained. 

II. Algorithm Derivation 

In this section, we propose a distributed algorithm for 
solving ([T]i. We begin by introducing some graph theory 
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concepts and notation. Next, we address the scenario of a 
bipartite network, where a proof of convergence is given and 
the exposition is simpler; then, we generalize the algorithm for 
arbitrary networks. In this case, although we do not provide 
a proof of convergence, the algorithm always converges in 
practice, as shown in section IV] 

Network notation. We represent a network as an undirected 
graph Q = (V, £), where V = {1,2,..., P} is the set of nodes 
and £ C V X V is the set of edges. The cardinality of these sets 
is represented respectively by P, the number of nodes, and E, 
the number of edges. An edge is represented by {i, j} — {j, i}, 
and G £ means that nodes i and j can exchange data 

with each other. We define the neighborhood Mp of a node p 
as the set of nodes connected to node p, but excluding it; the 
cardinality of this set. Dp \Afp\, is the degree of node p. 

Node coloring. Given a network Q, a (node) coloring 
scheme is an assignment of numbers, which we call colors, 
to the nodes of the network such that no adjacent nodes have 
the same color The colors wiU be represented as a set C = 
{1, 2, . . . , C}, where C is the total number of colors. For each 
color c, the set Cc C V represents the nodes with color c, 
and Cc is the number of nodes with that color, i.e., \Cc\ — Cc- 
A network Q can have several coloring schemes, and x{G) 
denotes the minimum number of colors required to color it; 
it is called its chromatic number. Finding a coloring scheme 
that uses only x{G) colors is NP-hard for x{G) > 2. Node- 
coloring is very important in wireless networks, since it is used 
in some medium access (MAC) protocols; consequently, many 
distributed node-coloring algorithms have been proposed, e.g., 
IE? I, 1 28 1, 1 29 1, |30|. Henceforth, we will assume that the 
network Q is given together with a coloring scheme C. 

Problem manipulations. We start by assuming the net- 
work Q is bipartite or, equivalently, that x{G) — 2, i.e., it 
can be colored with two colors. An example of a bipartite 
network is a grid. We start with this assumption for two 
reasons: the exposition is simpler, and we prove that our 
algorithm converges for this case. Without loss of generality, 
assume the nodes are ordered such that Ci = {1, 2, . . . , Ci} 
and C2 = {Ci + l,Ci + 2,...,C}. 

A common technique to decouple problem ([T]i is to assign 
copies of the global variable x to each node and then constrain 
all copies to be equal. Denoting the copy held by node p 
with Xp e M", problem (|D is written equivalently as 

minimize fi{xi) + f2{x2) H h fp{xp) 

subject to Xp e Xp , p — 1, . . . , P (2) 

Xi — Xj , {j, j} G f , 

where x = {xi, . . . , xp) € (M")-^ is the optimization variable. 
Problem (|2|i is no longer coupled by the intersection of the 
sets Xp or by a common variable in all /p's, as ([T|), but instead 
by the new equations Xi — Xj, for all pairs of edges in the 
network £ £. These equations enforce all copies to be 

equal since the network is connected, due to Assumption |3] 
Note that they can be written more compactly as {B^ ®In)x = 
0, where B G M^^^ is the node arc-incidence matrix of the 
graph, /„ is the identity matrix in R", and ® is the Kronecker 
product. Each column of B is associated with an edge {i, j} G 



£ and has 1 and —1 in the ith and jth entry, respectively; 
the remaining entries are zeros. We can partition B into two 
matrices Bi and B2, where Bi contains the first Ci rows 
and B2 contains the remaining rows. Therefore, {B^ ®In)x = 
[BJ (g) /„)xi + {BJ (g) In)x2, where xi = (xi, . . . , xcj G 
(R")*^! and X2 = {xc,+i, ■ ■ ■ ,xc) G This enables 

rewriting (|2|i as 

minimize Y.pec^ fpi^p) + Epsc^ fp(^p) 
subject to xi G Xi , X2 G X2 

{Bj (g) /„)xi + {Bj (g In)x2 = , 

where Xi = Hpec -^p^ ^'^^ i = 1,2. Problem (O can be solved 
with ADMM, explained next. 

ADMM. The alternating direction method of multipliers 
(ADMM) was proposed in the seventies by ^141 . ifTsl . Given 
two functions gi and g2, two sets Xi and X2, and two 
matrices Ai and A2, ADMM solves 

minimize gi (xi ) + 32 (a;2 ) 

subject to xi E Xi , X2 & X2 
A1X1+A2X2 =0. 

It uses the method of multipliers concatenated with an iteration 
of the Gauss-Seidel algorithm ifTTl . i.e., it iterates on k: 

G arg min Lp{xi,X2]\^) (5) 

xieXi 

x^-^^ eaig min Lp{x\+'^,X2]\'') (6) 

X2eX2 

A^+i =A'^+p(Aix^+i+A2a;^+i), (7) 

where 

Lp{xi,x2; A) := gi{xi) +32(2:2) + \^ {Aixi + A2X2) 

+ ^11^1x1+^22:211' (8) 

is the augmented Lagrangian of (|4|i, A is the dual variable, 
and p > is a predefined parameter. More information 
about ADMM, including variations, applications, and related 
algorithms can be found, for example, in ifTTI . Il23l . ||3T]| . The 
following theorem establishes its convergence. 

Theorem 1 dUIl, ES). Assume gi : R"' R is a convex 
function over R"% Xi C R"* is a closed convex set, and Ai 
is a full column-rank matrix, for i = 1,2. Also, assume 
problem dU is solvable. Then, the sequence {(xi , X2, A*^)} 
generated by Q-© converges to (x\, X2, A*), where 

1) (xi, xJ) solves (IHl 

2) A* solves the dual problem of 

minimize Gi(A)+G2(A), (9) 

A 

where Gi{X) = vnixiexA9i{xi) + \^ AiXi), i = 1,2. 

This theorem is more general than the versions in ||231 
and till Prop.4.2]. Namely, 1231 only proves objective con- 
vergence, but since it does not assume Ai and A2 to be 
full column-rank, the sequence {{x\, 2:2, A'^)} might not even 
converge. On the other hand, [1 1 1 proves claims 1) and 2) only 
for the special case when one of the yl^'s is the identity. Since 
we could not find a proof of this version of the theorem in 



4 



literature, we provide it in Appendix lAl The proof is, however, 
very similar to the ones in IfTTI . ||23l. 

Applying ADMM. Clearly, (|3]l has the same format as Q: 
just make the following associations 

g^ix^) = J2 -fp^^P^ ' ^^=^^, A, = BJ I„ , (10) 
pGC, 

for i — 1,2. The following theorem guarantees that, under 
Assumptions [T]|3] problem (O satisfies all the conditions of 
Theorem [T] thus, we can use ADMM to solve (|3]l. 

Theorem 2. Let Assumptions |7]|5] hold and consider prob- 
lem (|3]l. Then, for i — 1,2, the function '^p^c- fp^^p) 
convex over R", the set Xi is closed and convex, and the ma- 
trix BJ (g) In has full column-rank. Furthermore, problem (O 
is solvable. 

All conclusions of Theorem |2] except that Bj ® I„ has 
full column-rank, are derived straightforwardly Assumptions [T] 
and 12] and the equivalence between (|3]l and To see how 
Assumption [3] implies that each Bj ®In has full column-rank 
for i = 1,2, note that it is sufficient to prove that Bj has full 
column-rank. If we prove that BiBj has full rank, then the 
results follows because r:?mk{BiBj) — rank(i?j^). Now note 
that BiBJ is a diagonal matrix, where in the diagonal are the 
degrees of the nodes belonging to the subnetwork composed 
by the nodes in Ci. Assumption |3] states that the network is 
connected and thus no node can have degree 0; consequently, 
B,Bj has full rank. 

Algorithm for bipartite networks. We now apply ADMM, 
i.e., ©-([Til directly to Q. By analyzing the augmented 
Lagrangian (O, we will see that (|5]i yields Ci optimization 
problems that can be solved in parallel. Similarly, (|6]l yields C2 
optimization problems that can be solved in parallel. In fact, 
developing the squared term in ([8]l we obtain for the xi- 
minimization (|5]i: 

Xi^"^ e arg min gi{xi) + f]J xi + ^xj {Al AijXi , (11) 

where fyi := A^}^ + pi^A^ A'2)x\. Using ( flOl i, the quadratic 
term in (fTTT i becomes {p/2)xj {BiBj (g) As we had 

seen before, BiBj is a diagonal matrix with the degree of 
node p. Dp, in the (pp)th entry. This quadratic term can then 
be rewritten as {p/2)J2peCi -^pW^pW^- Regarding the linear 
term, 

fjjxi = {{Bi® Ir,)\^ + p{BiBj ® Ir,)xlY Xi 

(12) 

where sign(w) = 1 if w > and sign(ti;) = —1 other- 
wise. We decomposed the dual variable A as (. . . , ^{i.j}, ■ ■ ■), 
where X^i j^ = A^j ^i, is associated with the edge {i,j} G £. 
To see why ST% holds, note that the (ij)th entry of BiBj 
is —1 if {i,j} G £ and otherwise. In conclusion, using JTOl i 
on (fTTT i. we get Ci optimization problems that can be solved 



in parallel: 

x';+^ e arg min fpixp)+i-ff -p ^ x^Vxp+^\\xp\\^ 

where 7^ EjeA^p sign(j - P)>\p^jy for P e Ci, i.e., 
all nodes with color 1. A similar analysis for (|6]l yields the 
same problem for each node in C2, but using the neighbors' 
estimates at fc + 1. 

Note that the optimization problem each node solves de- 
pends on the dual variable A through 7^. In fact, we can 
derive a formula for updating 7^ from (|7]l : write (|7]l edge- 
wise, A;^_+J) = A;t.^^.j + psign(j - p){x^+' - x]+'), and 
insert this formula in the definition of 7p; the result is 

7p+^ = Ip + pJ2jeJ^pi^p'^^ ~ ^j) ■ In sum, we arrive at 
the following algorithm, named D-ADMM, after Distributed- 
ADMM. 



Algorithm 1 D-ADMM for bipartite networks 

Initialization: for all p G V, set 7p = a;^ = and fc = 1 
1: repeat 

2: for all p £ Ci [in parallel] do 

3: SetVp - pY^ J ^j^^x'] and find 

Xp = argmin Jp(xp) + Vp Xp H f-||.'Cp|| 

s.t. Xp e Xp 

4: Send Xp+'^ to Mp 

5: end for 

6: Repeat l2l5l for all p G C2, replacing x'] by 2;*^+^ in Vp 

1: for all p £ V [in parallel] do 

8: end for 

9: fc ^ fc + 1 

10: until some stopping criterion is met 



In Algorithm [T] there are two groups of nodes: Ci and C2. 
In each group, the nodes are not neighbors between them- 
selves and operate at the same time. In particular, one node 
receives a;^ or x^'^^ from all its neighbors and then solves the 
optimization problem of step |3] finding its estimate 2;^+^; it 
then broadcasts x^^^ to its neighbors. After having received 
all the estimates from its neighbors, node p can update 7p as 
in step|7] 

Algorithm [1] only applies to bipartite networks and its 
convergence is guaranteed by Theorems [1] and |2] Next, we 
generalize it for any type of network. 

Algorithm for arbitrary networks. So far we assumed 
the network was bipartite, i.e., it could be colored with 
just two colors: x{G) ^ 2. Now, suppose the network 
is colored with C > 2 colors. We derived problem (|2| 
from problem (O by partitioning the node-arc incidence 
matrix B into Sj] , where Bi and B2 contain the 

rows corresponding to nodes in Ci and C2, respectively. For 
convenience, we had assumed the nodes were ordered such 
that Ci = {1, 2, . . . , Ci} and C2 = {Ci + 1, Ci + 2, . . . , C}. 
Without loss of generality, assume a similar ordering for the 
nodes in the current case, i.e., Ci = {1, 2, . . . , Ci}, C2 = 
{Ci + 1, . . . , C1+C2}, ...,Cc^ {EtY Ce+1, ...,C}. This 
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induces a natural partition of B as [BJ BJ ■ ■ ■ i?^] . 
The counterpart of Q for this case is then 

ininimize J2c=i Ep<^c, Ui^v) 

subject to Xc&Xc, c=l,...,C (13) 

where the variable is (xi, . . . , Sc) G (M")^. To solve ( fT3] l, 
we cannot apply ADMM (|5]l-(l7]l directly, but its generaUzation 
to solve 

minimize gi{xi) H + gc{xc) 

xi,...,xc 

subject to Xc G Xc , c = 1, . . . , C 

Aixi H VAcxc = 0. 

Such a generalization is straightforward, but there are no con- 
vergence guarantees in the sense that there is no known proof 
of Theorem [T] for the resulting algorithm. Yet, experimental 
results, as the ones presented here, suggest that Theorem [T] 
might still hold. 

The following algorithm is a straightforward generaUzation 
of Algorithm [T] to solve ( fTST l. 

Algorithm 2 D-ADMM for general networks 
Initialization: for all p G V, set 7^ = a;^ = and fc = 1 



repeat 

for c = 1, . . . , C do 

for all p € Cc [in parallel] do 

k k k+1 k 

Vp - P 2^ -p ^ Xj 

jeMp jeJ\fp 
3<p i>p 

and find 



argmin fp{xp) + Vp^ Xp + \\xp 
s.t. Xp e Xp 



Send a;p+^ to Afp 
end for 
end for 

for all p G V [in parallel] do 

k + l fc I / fc + 1 k+l\ 

end for 

fc ^ fc + 1 
until some stopping criterion is met 



In Algorithm |2] there are C groups of nodes, arranged by 
colors. As in the bipartite case, in each group, the nodes are not 
neighbors between themselves and operate at the same time. 
In practice, they cannot operate at the same time because there 
is no central coordination. As seen in the expression for 
(step |3]l, one node, knowing the color of its neighbors, can 
operate only after having received the estimates x^'^^ from 
the nodes with lower color Excepting that, the algorithm is 
very similar to Algorithm |2] 

As said before, although there is no convergence guarantee 
for Algorithm |2] it converges in practice for several kinds of 
problems, and we have never seen a nonconvergent example. 
We point out that, using a reasoning similar to that of the 
discussion after Theorem |2] each matrix Bj can be proved to 
be full column-rank. 



III. Related Algorithms 

We now review some algorithms for solving ([T]). We start 
with describing the performance measure we will use and its 
practical relevance. 

Performance measure: communication steps. We divide 
the algorithms that solve ([T]i in two kinds: single-looped and 
double-looped. The single-looped algorithms have just one 
(nested) loop, and in each iteration, each node receives the 
neighbors' estimates, computes its on estimate, and broadcasts 
it to its neighbors. The double-looped algorithms have two 
nested loops and they are based on augmented Lagrangian 
methods (see ([8])). The outer loop serves to update the dual 
variable of the augmented Lagrangian; the inner loop serves 
for each node to receive the neighbors' estimates, compute 
its own estimate, and broadcast it to the neighbors, as one 
iteration of a single-looped algorithm. The natural way of 
comparing all algorithms is then through a communication 
step, which will denote one iteration (for the single-looped 
algorithms) or one inner loop iteration (for the double-looped 
algorithms). We say that an algorithm is more efficient than 
another one if it uses less communication steps for the same 
solution accuracy. 

Communication steps is a fair measure because it compares, 
not only similar computational operations at each node (as we 
will soon see), but also the amount of communications each 
node requires to achieve a solution with a given precision. 
Furthermore, one of the main concerns in designing algorithms 
and protocols for wireless networks is reducing the number 
of communications; the reason is because communicating is 
the slowest and most energy-consuming task [2], [32 J. The 
same concern is shared in supercomputing environments H. 
In contrast with execution time, communication steps are 
application and implementation independent. For example, 
given a distributed algorithm, we get different execution times 
if we change any of the following: medium-access protocol, 
transmission medium, scheduling, platform for the nodes, or 
even the way the algorithm is coded. Communication steps 
are, on the other hand, invariant to any of these. Note that, 
given a network with E edges, if we multiply 2E by the 
number of communication steps, we obtain the total number 
of communications in the network. 

We point out, however, that our algorithm, D-ADMM, is 
asynchronous, i.e., not all nodes perform the same tasks at the 
same time, in contrast with most of the algorithms presented 
next. More concretely, D-ADMM assumes a coloring for the 
nodes, and each node operates according to its color Other 
algorithms do not rely on any coloring scheme, enabling them 
to be synchronous, i.e., all nodes can perform the same tasks 
at the same timeQ Therefore, if one communication step in 
D-ADMM takes T time units, it would take T/C units in a 
synchronous algorithm, where C is the number of colors, in 
case the nodes were able to transmit messages at the same 
time. 

Gradient/Subgradient method. The classical approach for 
solving ([T]i is to use a subgradient method combined with 

'At our abstraction level, we are overlooking packet collision and other 
medium-access problems that may potentially destroy synchronism. 
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an averaging consensus algorithm. This approach was taken 
in ||6l, and consists of iterating in k 



E 



(14) 



for each node p. In (fl4] l. [s]s represents the projection of the 
point s onto the (convex) set S, Xp the estimate of node p 
at iteration k, the gradient (or a subgradient, in case the 
objective is not differentiable) of fp at the point a^pXp + 
Tlij^M ^pi^j' '^ij represents the influence node j exerts 
on node i at iteration k. There are some conditions the 
weights a^j have to satisfy f6\, for example, for each k, the 
matrix with entries ajj should be doubly stochastic. In each 
iteration, node p receives the estimates x'j from its neighbors, 
computes (fl4] i. and then broadcasts to its neighbors. It 

is thus a (synchronous) single-looped algorithm. Although this 
algorithm is provably robust to noise, it is known to perform 
very slowly, requiring lots of iterations to converge. 

ADMM-based algorithms. We now describe two algo- 
rithms that are based on ADMM, like ours. Both of them are 
based on different reformulations of yielding two different 
algorithms. The first algorithm is lfT6l (it also appears in ifTTl . 
ifTSl . lfT9l . II20I ). described in Algorithm |3] 

Algorithm 3 lfT6l 



Initialization: for all p G V, set 7^ 



■ and k 



repeat 
for all p 

set v: 



1 

p 



, P [in parallel] do 



7p -pE 



■je^fpU{p} -^3 



Xi and find 



Cp+^ = argmin fp{xp) + Vp Xp + pDp\\xp\f 



s.t. 



Send Xp'^^ to A/'p, and receive x^'^^, j G A/'p 
end for 

for all p = 1, . . . , P [in parallel] do 



7p^+^=7p* + pE,w,(-r^- 
end for 
k ^ k+1 
until some stopping criterion is met 



Although derived from a different reformulation of ([T]i, 
Algorithm [3] algorithm is very similar to D-ADMM. The three 
main differences are: 

1) Algorithm [3] is synchronous, i.e., all the nodes can 
perform the same tasks at the same time; 

2) while in Algorithm |3] node p uses Xp to construct the 
vector Vp (see step O, D-ADMM does not; 

3) Algorithm [3] is proven to converge for any connected 
network, and D-ADMM is only proven to converge in 
bipartite networks (although in practice it converges in 
any connected network). 

We will see that D-ADMM requires almost always less com- 
munication steps to achieve a given accuracy than Algorithmic] 
for several instances of ([T]l. 

Another ADMM-based algorithm is [21] (also appearing 
in II22I, llOl). That algorithm, in contrast with D-ADMM 
and Algorithmic] uses two communication steps per iteration. 



Since the number of iterations it takes to achieve a given 
accuracy in the average consensus problem is similar to 
Algorithmic] I.20J . it takes the double of the communication 
steps. 

Double-looped algorithms. Several double-looped algo- 
rithms were proposed to solve particular optimization prob- 
lems, which as D-ADMM, can be generalized to solve ([T]|. In 
particular, they can solve ([T]), by solving the following dual 
problem of ©I 

maximize Lp{\) , (15) 

where Lp(X) is the (augmented) dual function 

Lp(A) = inf J2p=i fpi^p) + T,{i,j}e£ (^'' ^ ^i) 

s.t. XpGXp, p—l,...,P, 



(16) 

and 0(J(?7) := v^ij + f ||?7|p. It can be proved that Lp{X) is 
differentiable, and thus ( ITsT l can be solved with a gradient 
method, or even with a fast (Nesterov) gradient method. The 
outer loop consists of applying such algorithm: it updates 
the dual variables A^^ jj. However, to compute the gradient 
of Lp(A), one needs to solve the optimization problem in ( ITSl i; 
and this requires another iterative algorithm, consisting of the 
second nested loop. Several approaches have been used to 
solve that problem, e.g., nonlinear Gauss-Seidel methods |9|, 
1 10 1, Jacobi-type methods [8|, and Nesterov methods |12|. 

Of all the algorithms described in this section, only gradi- 
ent/subgradient method does not solve an optimization prob- 
lem at each iteration. All the remaining algorithms solve the 
same problem as D-ADMM (step |4] of Algorithm |2]i, but 
possibly with different parameters. As illustrated in section [Vl 
in practice, our D-ADMM performs generally better than 
any of algorithms described in this subsection. The only true 
competitor is Algorithm |3] 

IV. Applications 

We now take some optimization problems from signal 
processing and see how they can be recast as ([T]i. We also 
review some of the conventional methods to solve them in 
a distributed way. Then, in section |V] we present simulation 
results of D-ADMM and other algorithms applied to these 
problems. 

Consensus. Consensus is one of the most fundamental 
problems in networks. Given a network with P nodes, node p 
generates a number or measurement, say 9p, and the goal is 
to compute the average 9* — (1/P) X^^^i ^'^ every node. 
The classical approach to consensus ll33l . ||26| is 



Ax" 



x° = 



where the pth entry of x'^ — {x^, . 



(17) 

e K-'^ contains the 
estimate of node p at time k and Q = {9i, . . . ,6p) e is 
the vector collecting the measurements. The matrix A reflects 
the topology of the network by having = for ^ £. 

Thus, {TT} is equivalent to = I]jGA/pU{p} "■pj^'j' i-^-' 

the estimate of node p at k + 1 will be a weighted average 
of its estimate and the neighbors' estimates at k. There 
are some constraints on choosing the weights, for example. 
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'^jeAf u{p} ~ ^- ^^'^ ^ recent survey on consensus 

algorithms using the approach ( [TT] ). 

Another approach to consensus, recently adopted in l,20J . is 
to solve the optimization problem 



mmimize 



p=i 



(18) 



in a distributed way. The solution of ( fTsT i is 9* ~ 
(l/-P)Ep=i^P- Clearly, ^ has the format of O by as- 
signing fp{x) = {x — OpY, Xp — R; thus, it can be solved by 
D-ADMM or any of the methods described in the last section. 
The work |20| applies Algorithm |3] for solving (fTsT l. 
In all algorithms, node p has to solve at each iteration 

minimize (x — 6p)^ + x + cx^ , 

X 

which has the closed-form solution x* — {29p — v)/{2{l^c)). 

Compressed sensing problems. Compressed sensing is a 
new paradigm for the acquisition of signals. For an introduc- 
tory survey see, for example, f34l. In compressed sensing, 
a signal is compressed at the same time as it is acquired, 
in contrast to the classical acquire-then-compress paradigm. 
Although compressed sensing makes the acquisition process 
simpler, the reconstruction of a compressed signal is more 
complicated, since an optimization problem has to be solved. 
The goal of this optimization problem is to find the sparsest 
solution of a linear system Ax = b, where A e jj™^" is a 
matrix related with the acquisition process and b E M™ is the 
acquired/compressed signal. If A satisfies some technical con- 
ditions L34J . then the solution to the Basis Pursuit (BP) 



b, 



(19) 



mmirmze ||x||i 

X 

subject to Ax = 

recovers the sparsest solution of Ax = b. In (fT9] l, ||a;||i = 
|a;i| + • • • + \xn\ is the £i-norm of x. A variation of (fT9] l is 
the LASSO [34|: 



mmimize 

X 

subject to 



\x\\i 
\Ax ^ 



b\\<<^: 



(20) 



where cr > is a fixed parameter and || • || the ^2-norm. 
Problem (|20] | is more appropriate when there is noise in the 
acquisition process and we have an upper bound a on the 
norm of the noise. Closely related with ( l20l i. is 



mmimize 



\Ax-bf +f3\\x\U 



(21) 



called Basis Pursuit Denoising (BPDN) 11251 . The parame- 
ter /3 > regulates the tradeoff between the sparsity of the 
signal and the weight of the noise. 

Here, we are interested in solving distributed versions of 
(fT9]l- (l2Tl i. Namely, we assume the matrix A is partitioned 
into P blocks, either by rows or by columns, as depicted 
in Figure |2] We assign the pth block to the pth node of a 
network with P nodes. For applications of this problem see, 
for example, our previous work ll24l . In Il24l . we solved BP 
with D-ADMM for both the row and the column partition. In 
this paper, we solve (|20] | for the column partition, and (l2Tl i 
for the row partition. The reverse cases, i.e., (|20] | for the row 



Row Partition 

Ai 



Column Partition 



Ax A2 ■■■ Ai 



Figure 2. Row partition and column partition of A into P blocks. In the 
row pailition a block is a set of rows; in the column partition a block is a set 
of columns. 



partition or (ISTT i for the column partition cannot be trivially 
recast as ([T]i. 

LASSO: column partition. Assume the matrix A is parti- 
tioned by columns, A = [Ai ••• Ap] , where the pth block 
of A is only known at node p. Assume vector 6, parameter a, 
and the number of nodes P are available at all nodes. This 
problem cannot be directly recast as ([T]i; we will have to do 
it through duality. However, if we solve the "simple dual" 
of (|20] |. we will not be able to recover solution to the primal 
problem afterwards. The reason is that the objective of (|20] | 
is not strictly convex. We thus start by regularizing ( |20l i by 
adding a small term to its objective, making it strictly convex: 



minimize ||x||i 

X 

subject to II Ax — b\\ < cr , 



(22) 



where (5 > is sufficiently small. This regularization is 
inspired by li35J . which establishes some conditions for reg- 
ularizations of this type to be exact. By exact, we mean that 
there exists 5 > Q such that the solution of ( |22] | is always 
a solution of ( |20] |. for 5 < 5. One of the conditions for 
exact regularization is that the objective is linear and the 
constraint set is the intersection of a linear system with a 
closed polyhedral cone. LASSO ( l20b can be recast as 



minimize 

x^t,u,v 

subject to 



\\u\\ < V 

u = Ax — b , V ~ a 
x < t , ~x < t , 



(23) 



where 1„ G M" is the vector of ones, and (x, t, u, v) the vari- 
able. Although the objective of ( l23b is linear and its constraint 
set is the intersection of a linear system with a closed convex 
cone JC :~ {{u,v) : \\u\\ < v}, there is not a proof of exact 
regularization for ( |23] ), because JC is not polyhedral. However, 
experimental results in [35J suggest that exact regularization 
might occur for non-polyhedral cones. We will also see in the 
results of our simulations that approximating ( |20| ) with ( l22b 
can still yield very accurate solutions. 

Before computing the dual of (|22] |. we first introduce a new 
variable y E M™ and make the column partition explicit: 



minimize J2n=ii\\xp\\i 

x,y f 

subject to ||y|| < a 

y = Ep=i ApXp - b 



2 1 1 I P ) 



(24) 



The dual problem of (|24] |. by dualizing the last constraint only. 



g 



can be shown to be equivalent to 

p 

minimize ^^5p(A), 

^ p=i 

where gp(A) j,{b'^ X+a\\\\\)-M^^{\\xp\\i + {A], Xp + 
lll^plp) is the function associated to node p. Each function gp 
is convex because it is the pointwise supremum of convex 
functions |5 Prop. 1.2.4]. The problem each node has to solve 
in each iteration is addressed in Appendix IB1 

We point out that problem (|22] | can be ill-conditioned due 
to the small value that has to be used for S, if we want 
to get a good approximation of the original problem. In 
particular, the problem solved at each node can be very time- 
consuming, since it may require lots of internal iterations. We 
used 6 — 10^^ in our simulations. While such value for 5 
required a considerable amount of iterations for solving the 
internal problem of each node (we set 500 as the number 
of maximum iterations), it also allowed to get solutions of 
the original problem ( |20l i with relative errors of 0.1%, using 
few communications. BPDN, addressed next, handles the same 
type of problems as LASSO, but its distributed implementation 
does not require solving ill-conditioned problems. 

BPDN: row partition. In BPDN (l2Tl i. we assume that both 
the matrix A and the vector b are partitioned by rows: A = 
[Aj ■■■ Aj,]^,b:^[bJ ■■■ bj,]^. Therefore, dZB can 
be readily rewritten as 

p p 
minimize ^^^||j4j,a; — &p|p + — ||a;||i^ . (25) 
p=i 

The identification between ([T]i and (l25T l is immediate: 
set fp{x) := \\ApX — 6p|p + The problem that has 

to be solved at node p, 

minimize ||^pa; — 6„||'^ + -^||a;||i + v^a; + c||x||^ , (26) 

can be solved, for example, with the GPSR solver ||36l . 

Distributed support vector machines. In a Support Vector 
Machine (SVM) i37i Ch.7], we are given data in a matrix A e 
jgmxn^ where each row of A represents a data point in M". 
Each data point can be classified as belonging to group X 
or to group y. If data point belongs to X (resp. y), we 
set the ith entry of a vector d E M™ to +1 (resp. —1). The 
goal is to find the parameters (s, r) e M" x M of a linear 
separator z{a) — s^a + r such that z{a) > if a S A", 
and z{a) < if a G 3^. Once the pair (s, r) is found, we 
can attempt to determine if a new point a belongs to A" or 3^ 
just by analyzing the sign of z{a). There are several ways of 
finding (s,r); one of them is solving the quadratic program 

minimize llslP 

(27) 

subject to D{As — Im'") > Im , 

where D = Diag((ii, . . . ,dm), and 1„, is the vector of ones 
in M™. Many times, the data is distributed among several 
entities, and processing it in a distributed way is prohibitive 
for complexity or privacy reasons. Here, we assume the model 
of ll38l . ||T9l : there are P entities, and each entity p has nip 



data points, or equivalently, a block Ap E R™pX" of rows 
of A, and their corresponding classification, i.e., the pth diag- 
onal block Dp E K'"p><™p of D, such that mi + - • ■+mp = m. 
This enables us to rewrite dZTl i as 

minimize y^fLi PIP 

s.r ^ 

subject to Dp{ApS - Imr) > 1™ , p^l,...,P . 

(28) 

Problem (|28T l has the format of ([T): just set fp{s,r) ~ ||s|p 
and Xp = {(s,r) e M« x M : Dp{ApS - l„,,r) > 1„.J. The 
problem each node has to solve at each iteration is a quadratic 
program, which we solve with Matlab's quadprog function. 
Algorithm [3] was applied in |19| to solve a version of (l28l l that 
takes into account possible violations of the linear separability 
of the data. Here, for simplicity, we just consider (|28] | and 
assume the data can be separated Unearly. 

V. Experimental Results 

In this section, we show some simulation results of D- 
ADMM, and the algorithms from section |III1 applied to the 
problems just described. Of these applications, consensus is 
the one that assesses the algorithms the best, since neither the 
problem solution, nor the problem each node solves at each 
iteration, requires iterative algorithms or any kind of solver: 
they have closed-form solutions. We will, therefore, give more 
emphasis to consensus than to the other applications. We start 
with describing the network models. 

Network models. We generated networks according to the 
models described in Table |T] All networks were generated 
using the igraph package in R P3l . Il44l . Table |II] shows 
the 56 different networks we generated: given a model and 
fixed its parameters, we generated 8 networks with different 
number of nodes, ranging from 10 to 2000; we used 7 
different combinations of models and parameters. All net- 
works, except Lattice networks, were generated randomly, and 
we had to guarantee that all networks were connected (cf. 
Assumption O. So, every time we would get a non-connected 
network, we would generate another network with slightly 
different parameters, changed in the direction to make the 
next network connected with greater probability. Therefore, 
there were some exceptions to the parameters described in 
Table For example, network number 1 with 10 nodes was 
generated with p = 0.27, instead of with p = 0.25, or network 
number 6 with 10 nodes was generated with d = 0.36, instead 
of with d = 0.2. 

Table nil also shows the average node degree and the number 
of colors used for each network. The average node degree 
can be as low as 3, for example in the Lattice network 
with 50 nodes, or as high as 1499, in the Erdos-Renyi network 
with 2000 nodes. The number of colors each network was 
colored with is also varied. Note that only the Lattice networks 
were colored with 2 colors, and thus these are the only 
networks for which D-ADMM is proven to converge. 

We use all networks of Table HI] for the consensus problem, 
since we study that problem in more detail. To illustrate the 
algorithms for the remaining applications, we will only use 
the networks with 10 and 50 nodes. 
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Table I 
Network MODELS. 



Name Pai'ameters Description 



Erdos-Renyi 11391 


P 


Every pair of nodes {i,j} is connected or not with probability p 


Watts-Strogatz (40) 


{n,p) 


First, it creates a lattice where every node is connected to n nodes; then, it rewires every link with probability p. 
If link {i, j} is to be rewired, it removes the link, and connects node i or node j (chosen with equal probability) 
to another node in the network, chosen uniformly. 


Barabasi-Albert f4T\ 




It starts with one node. At each step, one node is added to the network by connecting it to 2 existing nodes: 
the probability to connect it to node p is proportional to Dp. 


Geometric l42l 


d 


It drops P points, corresponding to the nodes of the network, randomly in a [0, 1]^ square; then, it connects 
nodes whose (Euclidean) distance is less than d. 


Lattice 




Creates a lattice of dimensions m X n; m and n are chosen to make the lattice as square as possible. 



Table II 

Network parameters, average degree, and number of colors. 



Number Model Parameters Average degree (top). Number of colors (bottom) 



Number of nodes 





10 50 


100 200 500 700 1000 2000 


1 Erdos-Renyi 0.25 


3 12 
3 7 


26 49 125 174 250 499 
12 18 37 47 62 109 


2 Erdos-Renyi 0.75 


6 37 
5 19 


75 150 375 523 748 1499 
32 53 115 155 213 380 


3 Watts-Strogatz (2, 0.8) 


3 5 

4 4 


6 8 8 6 6 10 
5 5 6 5 5 7 


4 Watts-Strogatz (4, 0.6) 


5 7 
4 5 


8 8 8 10 8 10 
6 6 6 7 6 7 


5 Barabasi-Albert 


3 4 
3 3 


4 4 4 4 4 4 
4 4 4 4 4 4 


6 Geometric 0.2 


3 5 

4 8 


10 20 52 72 105 209 

11 17 35 46 64 120 


7 Lattice 


3 3 
2 2 


4 4 4 4 4 4 

2 2 2 2 2 2 



Choosing p. In section |III] we reviewed the distributed 
algorithms in Hterature that can solve ([T]l. All these algorithms, 
including D-ADMM, but excluding the gradient/subgradient 
method, are based on augmented Lagrangian duality. This 
means that, for each algorithm, we have to choose the 
parameter p of the augmented Lagrangian (see, e.g., ([8]l). 
We are unaware of any scheme that chooses this parameter 
in an optimal way before the execution of the algorithm. 
There are, however, schemes for updating it during the ex- 
ecution of the algorithm 1231 , ||45l , but it is not straightfor- 
ward to implement them in a distributed scenario. We will 
thus adopt the following scheme for choosing p: given a 
fixed network, we execute several times all the algorithms 
that depend on p, each for a different p, chosen from the 
set {10-4, 10-3, 10-2, 10-1, 10", 101,102}; then, we pick 
the p that leads to the best result, i.e., the least number 
of communication steps. This scheme was adopted in all 
experiments. 

Consensus. We now present simulation results for the 
consensus problem. The data was generated the following 
way: node p generates, independently of the other nodes, 
one number 9p from a Gaussian distribution with mean 10 
and standard deviation 100. We chose such a large standard 
deviation to make sure that the nodes' numbers differed 
significantly. 



Figure [3] shows the results of the simulations of D-ADMM, 
the algorithms described in section |III] and the additional 
algorithm from |26|, which uses the approach ( [TtI i. Note 
that [26] only solves the consensus problem and cannot be 
trivially generalized to solve Figure [^a)] depicts the num- 
ber of communication steps, which is equal to the number of 
iterations for the single-looped algorithms and to the number 
of inner iterations for the double-looped algorithms, for all the 
networks with 50 nodes. All algorithms stopped after reaching 
a relative error of 10-^%, i.e., ||a;'= - lp9*\\/{y/P\0*\) < 
10—*, or after reaching the maximum number of 1000 com- 
munication steps. The optimal solution 6* had been computed 
before. It can been seen from the plot of Figure [^a)] that D- 
ADMM is the algorithm requiring the least communication 
steps to converge, i.e., to achieve a precision of 10"^%, 
for all networks except the first two, where f26l| was better. 
Figurel ^fb)! shows the error evolution along the communication 
steps for network 3 (cf. with point 3 of Figure [ 3{a)| i. It can 
be observed that D-ADMM required always the least number 
of communication steps to reach any accuracy between 10% 
and 10-2%. 

We only consider the three best algorithms (D-ADMM, 
|fT6l , and ||26l ) for a more thorough analysis, in Figure |4l 
here, we simulate these algorithms in all the networks of 
Table |ll] In each of the plots of Figure |4] it is shown the 
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number of communication steps required to achieve 10^^% 
of accuracy as a function of the number of nodes. It can 
be seen that D-ADMM always required less communication 
steps than fTSl, except for the smallest network in almost all 
scenarios. D-ADMM also required less communication steps 
than ||26| . except for the Erdos-Renyi networks (Figures |4 Ua)| 
and| ^b)| i. f26\ was the only algorithm achieving the maximum 
number of 1000 iterations, which occurred for the Geometric 
and Lattice networks (Figures | ^f)| and | ^g)| i. We also note a 
curious fact from these plots: the number of communication 
steps is almost invariant to the size of the networks (for the 
considered range). 

We thus conclude that D-ADMM is a competing algorithm 
for solving the consensus problem, since it requires, in general, 
less communication steps to converge than other state-of-the- 
art algorithm, and it has approximately the same behavior for 
very different networks. 

BPDN. We also executed some experiments comparing D- 
ADMM and US (Algorithm O, the two best algorithms, 
solving BPDN (|2TI) . Figure |5] shows the results of such 
experiments, where we only used the networks with 50 nodes. 
For generating data, we used the Sparco toolbox |46|, namely 
problems with ID's 7 and 902. Problem with ID 7 contains 
a 600 X 2560 matrix A with Gaussian entries, and problem 
with ID 902 contains a 200 x 1000 partial DCT matrix. In both 
problems, we added a small amount of noise to the provided 
vector b, in order to make the application of i2T[ more realistic. 
The trade-off parameter f3 was set to 1 for problem ID 7 and 
to 0.3 for problem ID 902. We ran the algorithms until they 
reached a relative error of 10~^% or the maximum number of 
iterations 1000, only reached by fT6l in network 6. At each 
iteration, each node solved problem (l26T l with GPSR Ii36j . 

The results for problems with ID's 7 and 902 are shown 
respectively in Figures | ^a)| and | jjb)| These show the 
number of communication steps for each network model. 
Again, each point in the plots is the best result for p G 
{10"'*,10"3 ...,102}. It can be observed that D-ADMM 
always required less communication steps than |16|; however, 
there were some cases where both algorithms required almost 
the same communication steps, namely, for networks 2 and 3 



of Figure | ^a)[ and for networks 2 and 4 of Figure | jjb) 
The similarity of the shape of the curves of Figures | 3Ia)| 
and | jjb)| indicates that changing the network models causes 
more variability in the difficulty of the problem than the 
problem data. 

LASSO. The results of our experiments for LASSO (|20] | 
are shown in Figure |6] Again, we just compared the best 
two algorithms: D-ADMM and |16|. For the matrix A and 
vector b, we used the same data as in BPDN: problems 7 
and 902 from the Sparco toolbox. The bound for the noise 
was set to a ~ 0.5 for problem ID 7, and to a = 0.1 for 
problem ID 902. In all experiments, which were executed for 
the networks with P = 10 nodes, the regularizing parameter 
in (l22T i was always 5 — 10^^. With such a choice, both 
algorithms were able to achieve the relative error of 10^^% 
in all instances (the maximum number of iterations was never 
achieved). Note that the error was measured with respect to 
the solution of the original problem ( |20] |. which was obtained 
beforehand using the algorithm SPGLl li47l . 

In Figure |6] we can see that, in spite of solving a slightly ill- 
conditioned problem, D-ADMM and lfT6l required consider- 
ably few communication steps to converge. Also note that the 
behavior of the algorithms is approximately uniform over all 
the types of networks. In contrast with BPDN, it was changing 
the data what produced different curves: the curves for ID 902 
are on top of the curves for ID 7, which reveals that the former 
was harder to solve. For each problem, D-ADMM always took 
less communications steps than [i6| to converge. 

SVM. Quite surprisingly, the problem that required the 
largest number of communication steps was SVM. For this 
problem, we used data from the UCI machine learning reposi- 
tory Ii48i . namely the Iris dataset, which contains two clusters 
of points that are Unearly separable. Each point has n — A fea- 
tures and there are m = 100 total points. We first solved (l27T i 
with the Matlab function quadprog. Next, we executed D- 
ADMM and lfT6l . which stopped after achieving a relative 
error of 10~^% or lO'' communication steps. 

Figure |7] shows the results for the networks with 50 nodes. 
Again, D-ADMM required always less communication steps 
than [16] to converge; however, both algorithms required an 
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(f) Network 6: Geometric, d = 0.2 
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(g) Network 7: Lattice 
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Figure 4. Number of communication steps to achieve a 10~^% precision as a function of the number of nodes, for the consensus problem. The networks 
were generated according to Table llll and the algorithms compared are D-ADMM (Algorithmic), |T6| (Algorithm [5), and 1261 . Note that | (f ) | and | (g) | have a 
scale for the vertical axis different from the other plots. 
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Figure 5. Results for BPDN: communication steps to acliieve 10 ^% of accuracy in the networlcs witli 50 nodes. The dataset is from the Sparco toolbox; 
problems (a) ID 7, and (b) ID 902. 



amount of communications one order of magnitude larger 
than all the previous applications. And this happened with a 
relatively small problem. This might indicate that problem (ITTT i 
is inherently difficult to solve in a distributed way. 
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Figure 6. Results for LASSO: communication steps to achieve 0.1% of 
accuracy in the networks with 10 nodes. 
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Figure 7. Results for SVM: communication steps to achieve 10 ^% of 
accuracy in the networks of 50 nodes. 



VI. Conclusions 

We proposed an algorithm for solving separable problems 
in a distributed way. This means that, given a connected 



network, each node has a cost function and a constraint set 
associated. While keeping their costs and constraints sets 
private all the time, all nodes cooperate in order to solve 
a global optimization problem, which minimizes the sum of 
all the nodes' costs, and constrains the solution to be in the 
intersection of all sets. We showed how several problems from 
signal processing and control can be solved with the proposed 
algorithm, named D-ADMM, and presented extensive simula- 
tion results. These results show that D-ADMM requires almost 
always less communications to converge than any other state- 
of-the-art distributed optimization algorithm. 
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Appendix A 
Proof of Theorem[T] 

First, note that the assumptions imply that strong duality 
holds for (|4|. This follows directly from |5, Prop. 6. 4. 2] and it 
means that the optimal cost of (|4| and of (|9]l have the same 
(finite) value, which will be denoted by p*. It also means 
that (|9]l is solvable. Let then (a:*,a;5,A*) denote any primal- 
dual solution. We will use the notation p'^ = + g2{x2), 
= AiXi + A2X2- Using only strong duality and our 
assumptions on Xi and X2, we can prove that 

p* ~ p'^+i < A*^r'^+i (29) 

- p(A2(4+i - x'2)ViA2ix^+' - xl) - r'^+i) (30) 
yk+i < yk _ p\\rk+if _ p||A2(a;^+i - , (31) 

where is the Lyapunov function 

■.^-U'^ -yf + p\\A2{x'i-xi)\\\ 
p 

This is done in 123 il and we omit the proof here. In the 
course of that proof, the following fact is used: if </> and \1/ 
are convex functions and 5* is continuously differentiable, 
then X* E argmin-c {</!)( a; ) + '^'{x) : x € X} implies 
X* £ a.rgmma:{<j){x) + W'i>{x*)^ {x — x*) : x £ X}, for any 
closed convex set X. Using A*^ = A*^^^ + pr'' and applying 
the previous fact to (|5]l and (|6]l, we get, respectively, 

x'l g arg min gi{xi) + (A'' - ^^2(2:2 " X2^^))'^AiXi 

(32) 

X2 e arg min 52(2^2) + X''^ A2X2 . (33) 

X2&X2 

Limit points of {(x'l,x^)} are primal optimal. 

From dlTT l. we see that < V'' < V'^ for any k, and thus {A''} 
and {yl22:|} are bounded, showing that they have limit points. 
Since A2 has full column-rank, {0:2} is also bounded and thus 
has limit points. To show existence of limit points of {x'l}, 
we first have to show that —t' 0. Iterating (IJTT l we get 
00 

p^(||r^-+i|r + P2(x^+i-4-)ir)<V^", 

fc=0 

from which we conclude r*^ ^ and A2{x2^^ — X2) — > 
as fc ^ 0. Now, r'^ — Aix\ + ^22:2 — > 0, together with the 
boundedness of {^22:2}, shows that {Aix^^} is bounded and 
thus also {x\}, because Ai has full column-rank. Therefore, 
{x\} has limit points. 

We showed the existence of limit points of {(a^i , now, 
we show their optimality. In fact, taking fc — > 00 in (l29l l 
and (|30] |, using the boundedness of {A2a;2} ^"d {A'^}, and 
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the facts that r'^ — > and ^2(^2 — ) — )• 0, we conclude 
that p'' = giix'l) + g2ix^) P*- 

Limit points of {A*"} are dual optimal. We had seen 
that {A*^} has Hmit points because it is bounded. Let A 
be such Hmit point, and let K, be the set of indices such 
that {X'^jkeic -> A and also that {{x'i,x^)}keic -> (^1,^2), 
where (xi, X2) is a limit point of {{xi, )}. Now, define A*^ = 



„fc-i 



). Since A2{x2 - X2 ^) 



A'^ - pA2ixl 

have limfc->oo{A-j-fc6K = iim^^ 
the definition of Gi and G2, and (l32Ti-(l33Tl. we see that 



0, we 

{A'^lfceK = limfc^oo{A''}feeK = A. Now, using 



Gi(A")= inf + (A") ' Aixi 



5i(xf) + (A^)'Aia;f 



<5i(xi) + (A'=)^Aix 



(34) 
(35) 



G2(A'^)= inf .92(a;2) + (A*) ' ^2X2 

X2eX2 



= 52(^2') + (A")^^2a;^ (36) 
< 52(^2) + (A''^)^A2a;2, V,,ex. ■ (37) 
Adding (l34l l and ( |36] | and taking the limit k — > +00 (fc e /C), 



lim (Gi(A^-) + G2(A'=)) = V(5^(S0 + FA,x,) 

/C3fe — ^-oc 

i=l 

= p* = L{X*), (38) 

where the second-to-last equality follows from primal optimal- 
ity (and feasibility) of {xi,X2), and the last equality follows 
from strong duality. Adding (|35] | and dSTl i and taking the 
limit k +00 (k e /C), 

2 

lim (Gi(p-) + G2(A^-)) < y2{g^{xi) + ^ Ax,) , 

4=1 

for all Xi e Xi, i = 1, 2. In particular, taking the infimum on 
the right-hand side: 



lim (Gi(A'=)+G2(A'=)) <Gi(A) + G2(A), 

K3k — ^00 



(39) 



From ^ and_(l38]l we conclude that Gi(A) + G2(A) > L{X*), 
showing that A is dual optimal. 

The sequence {(a;^, a;2, A*")} converges. We have seen 
that {(a;^^, a;2, A*^)} has limit points and they are primal-dual 
optimal. Now, we will see that this sequence has only one 
limit point, and thus converges. Let (a;i,a;2,A) be a limit 
point of {(x^'j 2:2 , A*^)}. Since it is primal-dual optimal, (|29] |- 
(IJTT i hold with (x^,X2,A*) replaced by (xi,X2,A). We had 
seen that was convergent because it was bounded and 
nonincreasing. With the previous replacement, we now see that 
its limit is 0. Therefore, A''" A and Xj — > X2 (because A2 has 
full column-rank). Since r'^ — Ai(xi — xi)+A2(x2 — X2) — > 0, 
we also have x^ — >■ xi (because Ai has full-column rank), i.e., 
{(xf , X2, A*^)} converges. □ 



for some vector v and scalar c, and with gp{X) = -p(6^A + 
fillAII) - inf,^(||xp||i + (AjX^Xp + f ||xp|p). Define 



0p(A) 



inf \\xp\\i + {Aj X)^ Xp + -\\xp\\ 

Xp ^ 



(41) 



and let Xp(A) denote the solution of the optimization problem 
in (I4TI 1 for a given A. Since the objective in that problem 
is strictly convex, (j>p is differentiable and its gradient is 
given by V0p(A) = —ApXp{X). Furthermore, each component 
of Xp(A) can be found in closed-form: —{ri + l)/5 if Ti < — 1, 
— (ri — if Ti > 1, and otherwise, where denotes the 
ith row of Aj X. Problem ( |40l ) then becomes equivalent to 

minimize (/)p(A) + ^6^A + -^||A|| + w^A + c||Af , 

or introducing an epigraph variable t, 

minimize 0„(A) + j^b'^ X + j;t + w^A + c||A|p 

X,t J:^ 1^ 

subject to II A|| < t . 

Problem (02]) can be solved using a projected gradient method 
We will use FISTA |13, §4], whose complexity is 0(1/ k^). 



(42) 



FISTA solves 



minimize /(A) + 5(A) . 



(43) 



where / is a convex continuously differentiable function with 
a Lipschitz continuous gradient V/ with constant L (i.e., 

l|V/(?7) - V/(A)|| < L\\7j- All for any A,7,), and g is a 
convex function. FISTA consists of iterating A'^^^ = [A*"' — 
iV/(A'=))]+, where 



argmin g{z) + — - ?7||^ ■ 
z 2 



(44) 



We set /(A) to be the objective of (|42| |. and h{X) to be the 
indicator function of the set S := {{X,t) : \\X\\ < t}. In this 
case, (|44] | becomes the projection onto S, which can be shown 
to be 



[(A,t)]J 



(A,t) 
(0,0) 



\M\ 



t > 

t<-m 

-llxll < t < llxll 



The Lipschitz constant of V/ can also be shown to 
be a^^^{Ap)/d + 2c, where crmax(^p) is the largest singular 
value of A„. 



Appendix B 
Problem For Each Node in LASSO 

When LASSO (|20] | is solved in the column partition case, 
node p has to solve, in each iteration. 



minimize gp{X) + v A + c||A|| , 



(40) 



